library(tidyverse)
library(psych)
library(lavaan)
library(lme4)
library(data.table)
library(readxl)
library(magrittr)
library(haven)
library(stargazer)
library(GGally)
library(car)
library(kableExtra)
library(yarrr)
library(lsmeans)
library(broom)
library(multcomp)
library(lsr)
library(apaTables)
library(MBESS)
Every once and a while R releases a new version of the software. This will not download automagically, so you should check the r website https://www.r-project.org/ from time to time, or follow them on twitter. You can see what version you are currently on by entering version
into the console.
#install.packages()
#require()
#library()
Calling ?function
allows you to look at the arguments for a function as well as their defaults. This is really helpful for when using new functions, knowing what abilities a function has, or for troubleshooting.
?read.csv
There are three main ways to read in csv files. First, you can use base R with the function read.csv
. Alternatively you can use the read_csv
function from the readr package. The final way is with the fread
function from the data.table
package.
# Base R
iris.1<- read.csv("./data/iris.csv",row.names=NULL)
# Readr
iris.2<- read_csv("./data/iris.csv")
# data.table
iris.3<- fread("./data/iris.csv")
For excel files, you can use the readxl package. Note: if you have multiple sheets you will need to specify which sheet you want to read in (i.e., read_excel("<filename>",sheet = "<sheetname>")
readxl_example("deaths.xlsx")
## [1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library/readxl/extdata/deaths.xlsx"
Deaths<-read_excel(readxl_example("deaths.xlsx"))
In this example we are downloading movie data from GitHub. Note that this isn’t web scraping, this is merely accessing a previously compiled dataset, which in this case is in csv format. Reading in data from a website is helpful if you are accessing archival datasets etc. We won’t get into APIs just yet but this is the first step in that direction.
movies <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/fandango/fandango_score_comparison.csv",header=TRUE)
head(movies)
Sometimes data is saved in a format where values are separated by tabs or some other delimiter. We can use read.table
which allows us to specify how the file is delimited.
insect<- read.table(file = 'http://www.ucd.ie/ecomodel/Resources/INSECT.TXT',
sep = '\t',
header = TRUE,
skip=3)
For SPSS files I prefer to use the haven
package. Another alternative is the foreign
package, but it is a little less consistent than I would prefer.
survey<- read_sav("~/Downloads/survey.sav")
Getting a sense of the structure and characteristics of a dataset is an important step. There are a few different ways we can examine a dataset.
head
, or the last few rows with tail
. It defaults to 10 rows, but you can adjust that by passing a number to the second argument.head(iris.1)
tail(iris.1)
head(iris.1,20)
We can use describe from the psych
package.
psych::describe(iris.1)
We can print a descriptives table and get a headstart on putting together our final documentation with stargazer
.
We can use summary from base r.
summary(iris.1)
## X Sepal.Length Sepal.Width Petal.Length
## Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000
## 1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
## Median : 75.50 Median :5.800 Median :3.000 Median :4.350
## Mean : 75.50 Mean :5.843 Mean :3.057 Mean :3.758
## 3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
## Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900
## Petal.Width Species
## Min. :0.100 setosa :50
## 1st Qu.:0.300 versicolor:50
## Median :1.300 virginica :50
## Mean :1.199
## 3rd Qu.:1.800
## Max. :2.500
The Tidyverse is a collection of packages for reading, manipulating, analyzing, visualizing, and reporting in R. It has gained widespread popularity and rivals base R. Many people either strictly adhere to tidyverse packages, or utilize base R. For this reason, you may see people do the same thing in different ways, and they may recommend different solutions based on their preferred packages. Ultimately this points out one of the advantages of R which is that there are no right answers.
I think that tidyverse is untuitive, consistent across packages, flexible, and in many instances better performing (i.e., speed) than base R. Even if you choose not to use Tidyverse packages I think some of the functions are helpful to know.
Lets start with the basic pipe operator %>%
. The pipe operator allows you to run multiple functions on a dataset in succession while cutting down on the number of lines of code and cleaning up your codes appearance. I like to think of the pipe as meaning “and then”.
Here we are calling the iris.1 data set and then grouping that dataset by Species
, and then summarising those groups by calculating the mean. Because I do not assign this to a variable the output prints to the console without changing the data. One thing you will notice is that the pipe passes the data at the beginning to the next argument, so I am not specifying a dataset for the two functions, because it knows i’m referring to iris.1
. Also remember that the functions are applied consecutively. So its like the results of each row are stored in a temporary dataset and you performing the next set of functions on that new dataset (not the original one).
iris.1 %>%
group_by(Species) %>%
summarise(mean = mean(Petal.Length))
With the %<>%
assignment pipe, we can save the results to an object. This is equivalent to iris.1<- iris.1
. Here
iris.1 %<>%
group_by(Species) %>%
summarise(mean = mean(Petal.Length))
Dplyr is a package containing functions that allow us to manipulate data. Most if not all of the Dplyr verbs have a selection of scoped variants including _if
, _at
, and _all
. Each function has some of its own unique scoped variants as well.
Filter
allows us to.. well.. filter our data based on a formula that we define. You can use one criterion, you can specify that it filters on one criterion or another filter(Sepal.Length >= 5 | Sepal.Width < 3)
, or you can apply two filter conditions filter(Sepal.Length >= 5 & Sepal.Width < 3)
.
iris.2 %>%
filter(Sepal.Length >= 5)
The select function allows us to remove columns, or pick out columns for a new dataset.
You can either pass a list of column names (no quotes necessary), you can use :
and specify the first and last column of a range, or you can negative subset to remove columns.
Select comes with some unique modifiers which are useful for dealing with named columns. This includes Starts_with
,Ends_with
, and contains
.
demographics<-survey %>%
dplyr::select(id:educ)
demographics<- survey %>%
dplyr::select(-source)
iris %>%
dplyr::select(starts_with("Sepal"))
Select is also valuable for reordering columns
iris %>%
dplyr::select(Sepal.Length,everything())
The mutate function allows us to compute new variables. By default mutate
iterates over columns. You can make it iterate over rows with rowwise()
, however this is no longer supported by people such as Hadley Wickham, who suggest using purrr
instead.
survey %>%
mutate(year = 2015) %>%
dplyr::select(id,year)
Transmute is similar to mutate but it only keeps the new variables by default.
survey %>%
rowwise() %>%
transmute(op = mean(c(op1,op2,op3,op4,op5,op6,na.rm = TRUE)))
This is another package in the Tidyverse,but it does not focus on data manipulation. Rather, purrr is an invaluable resource for applying functions iteratvely across lists,nested datasets. I’m going to keep fleshing this section out so stay tuned.
iris %>%
mutate(Max.Len= purrr::pmap_dbl(list(Sepal.Length, Petal.Length), max))
reliability.data<- read_tsv("./data/data.csv")
Extraversion<- dplyr::select(reliability.data,starts_with("E"),-engnat)
Neuroticism<- dplyr::select(reliability.data,starts_with("N"))
Agreeableness<- dplyr::select(reliability.data,starts_with("A"),-age)
Openness<- dplyr::select(reliability.data,starts_with("O"))
Conscientiousness<- dplyr::select(reliability.data,starts_with("C"),-country)
measures<-list(Extraversion,Neuroticism,Agreeableness,Openness,Conscientiousness)
map(measures,psych::alpha)
## Some items ( E2 E4 E6 E8 E10 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( N2 N4 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( A1 A3 A5 A7 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( O2 O4 O6 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' optionSome items ( C2 C4 C6 C8 ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
## [[1]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## -0.39 -0.39 0.26 -0.029 -0.28 0.016 3.1 0.35 -0.34
##
## lower alpha upper 95% confidence boundaries
## -0.42 -0.39 -0.36
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## E1 -0.49 -0.44 0.23 -0.035 -0.31 0.017 0.22 -0.34
## E2 -0.28 -0.32 0.28 -0.028 -0.24 0.014 0.21 -0.34
## E3 -0.48 -0.44 0.22 -0.035 -0.31 0.017 0.21 -0.34
## E4 -0.24 -0.26 0.31 -0.023 -0.21 0.014 0.21 -0.34
## E5 -0.36 -0.33 0.26 -0.028 -0.25 0.016 0.20 -0.34
## E6 -0.35 -0.38 0.26 -0.031 -0.27 0.015 0.23 -0.34
## E7 -0.46 -0.44 0.20 -0.035 -0.31 0.017 0.20 -0.34
## E8 -0.27 -0.30 0.31 -0.026 -0.23 0.014 0.23 -0.36
## E9 -0.34 -0.32 0.28 -0.028 -0.24 0.016 0.23 -0.34
## E10 -0.19 -0.22 0.34 -0.021 -0.18 0.013 0.22 -0.34
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## E1 19719 0.38 0.36 0.2892 0.030 2.6 1.2
## E2 19719 0.22 0.25 0.1153 -0.152 2.8 1.3
## E3 19719 0.37 0.35 0.3026 0.024 3.4 1.2
## E4 19719 0.15 0.20 0.0047 -0.192 3.2 1.2
## E5 19719 0.29 0.26 0.1814 -0.079 3.4 1.3
## E6 19719 0.27 0.30 0.1578 -0.087 2.5 1.2
## E7 19719 0.40 0.36 0.3807 -0.008 2.9 1.4
## E8 19719 0.20 0.23 -0.0179 -0.163 3.4 1.3
## E9 19719 0.30 0.26 0.0656 -0.100 3.1 1.4
## E10 19719 0.13 0.16 -0.1046 -0.236 3.6 1.3
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## E1 0 0.24 0.23 0.28 0.18 0.07 0
## E2 0 0.21 0.24 0.24 0.18 0.13 0
## E3 0 0.08 0.17 0.24 0.28 0.23 0
## E4 0 0.10 0.21 0.28 0.25 0.16 0
## E5 0 0.09 0.16 0.21 0.28 0.25 0
## E6 0 0.26 0.32 0.19 0.15 0.08 0
## E7 0 0.23 0.21 0.18 0.19 0.18 0
## E8 0 0.09 0.19 0.23 0.26 0.24 0
## E9 0 0.17 0.20 0.19 0.23 0.21 0
## E10 0 0.08 0.16 0.19 0.25 0.33 0
##
## [[2]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.71 0.69 0.79 0.19 2.3 0.0026 3.1 0.67 0.38
##
## lower alpha upper 95% confidence boundaries
## 0.71 0.71 0.72
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## N1 0.66 0.63 0.74 0.16 1.7 0.0031 0.142 0.36
## N2 0.80 0.80 0.84 0.30 3.9 0.0019 0.095 0.41
## N3 0.68 0.65 0.76 0.17 1.9 0.0030 0.150 0.39
## N4 0.79 0.78 0.83 0.28 3.5 0.0019 0.125 0.41
## N5 0.67 0.64 0.76 0.17 1.8 0.0030 0.159 0.39
## N6 0.64 0.61 0.73 0.15 1.6 0.0033 0.139 0.36
## N7 0.64 0.62 0.72 0.15 1.6 0.0033 0.142 0.37
## N8 0.63 0.61 0.71 0.15 1.6 0.0034 0.138 0.36
## N9 0.64 0.62 0.74 0.15 1.6 0.0033 0.145 0.36
## N10 0.67 0.65 0.75 0.17 1.8 0.0030 0.145 0.36
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## N1 19719 0.68 0.68 0.65 0.56 3.3 1.3
## N2 19719 -0.32 -0.30 -0.50 -0.46 3.2 1.2
## N3 19719 0.60 0.61 0.55 0.48 3.8 1.1
## N4 19719 -0.14 -0.13 -0.32 -0.31 2.8 1.2
## N5 19719 0.64 0.64 0.57 0.51 3.0 1.3
## N6 19719 0.77 0.77 0.76 0.67 3.0 1.3
## N7 19719 0.76 0.75 0.76 0.65 3.2 1.3
## N8 19719 0.78 0.77 0.79 0.68 2.8 1.4
## N9 19719 0.74 0.74 0.72 0.63 3.1 1.3
## N10 19719 0.64 0.63 0.59 0.50 2.8 1.3
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## N1 0 0.11 0.20 0.22 0.25 0.22 0
## N2 0 0.08 0.20 0.28 0.28 0.16 0
## N3 0 0.04 0.11 0.16 0.35 0.34 0
## N4 0 0.17 0.27 0.27 0.18 0.10 0
## N5 0 0.15 0.25 0.23 0.24 0.13 0
## N6 0 0.16 0.24 0.22 0.22 0.16 0
## N7 0 0.12 0.22 0.22 0.25 0.19 0
## N8 0 0.22 0.24 0.20 0.20 0.14 0
## N9 0 0.13 0.22 0.21 0.27 0.17 0
## N10 0 0.19 0.25 0.23 0.20 0.13 0
##
## [[3]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## -0.024 0.053 0.38 0.0055 0.055 0.011 3.2 0.35 -0.15
##
## lower alpha upper 95% confidence boundaries
## -0.05 -0.02 0
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## A1 0.103 0.168 0.46 0.0219 0.202 0.0093 0.15 0.044
## A2 -0.073 -0.024 0.32 -0.0026 -0.024 0.0120 0.13 -0.205
## A3 0.123 0.199 0.48 0.0269 0.249 0.0093 0.15 0.064
## A4 -0.176 -0.144 0.22 -0.0142 -0.126 0.0131 0.11 -0.174
## A5 0.185 0.274 0.49 0.0403 0.378 0.0085 0.12 0.032
## A6 -0.203 -0.147 0.26 -0.0144 -0.128 0.0134 0.14 -0.174
## A7 0.187 0.278 0.48 0.0410 0.385 0.0085 0.12 0.032
## A8 -0.161 -0.119 0.28 -0.0119 -0.106 0.0129 0.13 -0.180
## A9 -0.247 -0.207 0.19 -0.0194 -0.171 0.0139 0.12 -0.174
## A10 -0.181 -0.124 0.29 -0.0124 -0.110 0.0132 0.14 -0.205
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## A1 19719 0.24410 0.142 -0.18 -0.145 2.3 1.4
## A2 19719 0.36191 0.415 0.36 0.062 3.9 1.1
## A3 19719 0.14985 0.086 -0.29 -0.192 2.2 1.2
## A4 19719 0.47230 0.543 0.65 0.197 4.0 1.0
## A5 19719 0.00741 -0.063 -0.39 -0.300 2.2 1.1
## A6 19719 0.50483 0.546 0.54 0.210 3.9 1.1
## A7 19719 -0.00089 -0.071 -0.37 -0.305 2.2 1.1
## A8 19719 0.45551 0.518 0.50 0.180 3.8 1.0
## A9 19719 0.54348 0.601 0.72 0.272 3.9 1.1
## A10 19719 0.47812 0.523 0.45 0.202 3.7 1.1
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## A1 0 0.39 0.25 0.13 0.14 0.10 0
## A2 0 0.03 0.08 0.17 0.34 0.37 0
## A3 0 0.40 0.26 0.17 0.13 0.05 0
## A4 0 0.03 0.07 0.14 0.36 0.40 0
## A5 0 0.34 0.35 0.16 0.10 0.05 0
## A6 0 0.04 0.09 0.18 0.31 0.38 0
## A7 0 0.34 0.35 0.17 0.10 0.05 0
## A8 0 0.03 0.09 0.22 0.39 0.26 0
## A9 0 0.04 0.08 0.14 0.37 0.37 0
## A10 0 0.03 0.09 0.29 0.34 0.25 0
##
## [[4]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.27 0.3 0.53 0.041 0.42 0.0075 3.3 0.38 0.14
##
## lower alpha upper 95% confidence boundaries
## 0.26 0.27 0.29
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## O1 0.075 0.12 0.39 0.016 0.14 0.0099 0.090 -0.00089
## O2 0.429 0.44 0.59 0.082 0.80 0.0057 0.086 0.19340
## O3 0.223 0.24 0.48 0.034 0.32 0.0083 0.095 -0.00089
## O4 0.381 0.41 0.58 0.071 0.68 0.0061 0.094 0.19340
## O5 0.133 0.14 0.41 0.017 0.16 0.0092 0.090 -0.00089
## O6 0.438 0.47 0.61 0.089 0.87 0.0058 0.088 0.19340
## O7 0.183 0.20 0.47 0.027 0.25 0.0087 0.099 0.01249
## O8 0.075 0.14 0.40 0.017 0.16 0.0100 0.095 -0.00089
## O9 0.218 0.24 0.52 0.035 0.32 0.0083 0.111 0.00937
## O10 0.144 0.15 0.41 0.019 0.18 0.0091 0.083 -0.00089
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## O1 19719 0.627 0.613 0.64 0.40 3.7 1.12
## O2 19719 0.011 -0.031 -0.27 -0.27 2.1 1.14
## O3 19719 0.403 0.430 0.35 0.15 4.1 1.01
## O4 19719 0.114 0.077 -0.15 -0.17 2.1 1.11
## O5 19719 0.552 0.597 0.61 0.35 3.9 0.94
## O6 19719 -0.064 -0.097 -0.37 -0.32 1.8 1.07
## O7 19719 0.465 0.505 0.42 0.25 4.1 0.92
## O8 19719 0.629 0.595 0.60 0.36 3.2 1.26
## O9 19719 0.409 0.429 0.25 0.17 4.1 0.98
## O10 19719 0.535 0.577 0.60 0.31 4.0 0.98
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## O1 0 0.05 0.10 0.24 0.33 0.28 0
## O2 0 0.36 0.31 0.19 0.09 0.04 0
## O3 0 0.02 0.06 0.15 0.31 0.46 0
## O4 0 0.39 0.29 0.21 0.07 0.04 0
## O5 0 0.02 0.05 0.25 0.40 0.28 0
## O6 0 0.53 0.27 0.11 0.06 0.04 0
## O7 0 0.01 0.05 0.17 0.40 0.38 0
## O8 0 0.12 0.18 0.25 0.27 0.18 0
## O9 0 0.02 0.05 0.14 0.34 0.44 0
## O10 0 0.02 0.06 0.20 0.34 0.38 0
##
## [[5]]
##
## Reliability analysis
## Call: .f(x = .x[[i]])
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.056 0.11 0.36 0.012 0.12 0.01 3.2 0.39 -0.16
##
## lower alpha upper 95% confidence boundaries
## 0.03 0.06 0.08
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## C1 -0.017 0.0077 0.28 0.00086 0.0078 0.0114 0.096 -0.163
## C2 0.105 0.1796 0.39 0.02374 0.2189 0.0097 0.100 0.016
## C3 -0.033 -0.0102 0.29 -0.00112 -0.0101 0.0116 0.111 -0.203
## C4 0.120 0.2053 0.41 0.02790 0.2584 0.0096 0.095 0.028
## C5 0.086 0.1008 0.34 0.01230 0.1121 0.0102 0.094 -0.163
## C6 0.140 0.2044 0.40 0.02775 0.2569 0.0092 0.093 0.028
## C7 -0.015 0.0129 0.30 0.00145 0.0130 0.0114 0.104 -0.163
## C8 0.162 0.2457 0.44 0.03493 0.3257 0.0093 0.102 0.028
## C9 -0.037 -0.0092 0.27 -0.00102 -0.0091 0.0115 0.096 -0.163
## C10 -0.056 -0.0343 0.27 -0.00370 -0.0332 0.0118 0.106 -0.180
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## C1 19719 0.39 0.457 0.419 0.120 3.3 1.1
## C2 19719 0.30 0.210 0.009 -0.052 3.0 1.4
## C3 19719 0.40 0.478 0.384 0.156 4.0 1.0
## C4 19719 0.24 0.165 -0.058 -0.083 2.7 1.2
## C5 19719 0.29 0.333 0.222 -0.032 2.7 1.2
## C6 19719 0.27 0.166 -0.032 -0.096 2.9 1.4
## C7 19719 0.40 0.451 0.368 0.112 3.6 1.2
## C8 19719 0.13 0.089 -0.228 -0.162 2.5 1.1
## C9 19719 0.44 0.477 0.461 0.132 3.2 1.2
## C10 19719 0.44 0.506 0.456 0.192 3.6 1.0
##
## Non missing response frequency for each item
## 0 1 2 3 4 5 miss
## C1 0 0.06 0.17 0.29 0.34 0.14 0
## C2 0 0.19 0.21 0.19 0.24 0.16 0
## C3 0 0.02 0.07 0.17 0.37 0.36 0
## C4 0 0.21 0.29 0.24 0.18 0.09 0
## C5 0 0.21 0.26 0.26 0.18 0.10 0
## C6 0 0.21 0.23 0.17 0.22 0.17 0
## C7 0 0.06 0.11 0.23 0.33 0.27 0
## C8 0 0.24 0.26 0.32 0.13 0.05 0
## C9 0 0.11 0.19 0.25 0.28 0.18 0
## C10 0 0.03 0.09 0.31 0.35 0.22 0
describe(survey)
This is done with the purrr package, which is the Tidyverse version of apply. It allows us to iterate over rows,columns,dataframes in a clean and concise manner. You can also do this with rowwise()
and mutate
, but again Hadley has issues with that.
Extraversion %>% mutate(score = pmap_dbl(.,function(...) mean(c(...))))
Here i’ll demonstrate a very basic regression. It’s not the complexity of the statistics, but rather showing how to go through every phase of an analysis.
I need to create a function here which I get to when I clean the data.
replace_over_4<- function(x) replace(x,x>4,NA)
I do this all in one pipe. Proceed at your own risk.
health_tbl<- read_tsv("./data/w4inhome_dvn.tab") %>%
dplyr::select(mh1 = H4MH3, mh2 = H4MH4, mh3 = H4MH5, mh4 = H4MH6,
jailage = H4CJ20, gender = BIO_SEX4) %>%
mutate_at(vars(mh1:mh4),replace_over_4) %>%
mutate(mh1 = 4 - mh1,
mh4 = 4 - mh4) %>%
rowwise() %>%
mutate(mh = mean(c(mh1,mh2,mh3,mh4),na.rm=T)) %>%
ungroup() %>%
mutate(gender = factor(gender,levels =c(1,2),labels = c("male","female"))) %>%
filter(jailage < 97)
dplyr::select(health_tbl, 5:7) %>% ggpairs()
dplyr::select(health_tbl, -gender) %>% cor(use ="pairwise")
## mh1 mh2 mh3 mh4 jailage mh
## mh1 1.0000000 0.18972172 0.33470223 0.4286938 0.07206430 0.74075325
## mh2 0.1897217 1.00000000 0.27916435 0.2129501 -0.03611862 0.57331576
## mh3 0.3347022 0.27916435 1.00000000 0.4123384 0.04372349 0.70121523
## mh4 0.4286938 0.21295010 0.41233840 1.0000000 0.02069340 0.75779718
## jailage 0.0720643 -0.03611862 0.04372349 0.0206934 1.00000000 0.04231706
## mh 0.7407533 0.57331576 0.70121523 0.7577972 0.04231706 1.00000000
dplyr::select(health_tbl,-gender) %>% apa.cor.table(.,filename="./output/Cortable.doc",table.number=1)
##
##
## Table 1
##
## Means, standard deviations, and correlations with confidence intervals
##
##
## Variable M SD 1 2 3 4
## 1. mh1 2.47 1.21
##
## 2. mh2 2.98 0.97 .19**
## [.09, .28]
##
## 3. mh3 2.30 0.94 .33** .28**
## [.24, .42] [.18, .37]
##
## 4. mh4 2.56 1.14 .43** .21** .41**
## [.34, .51] [.12, .31] [.33, .49]
##
## 5. jailage 19.16 3.42 .07 -.04 .04 .02
## [-.03, .17] [-.14, .06] [-.06, .14] [-.08, .12]
##
## 6. mh 2.58 0.74 .74** .57** .70** .76**
## [.69, .78] [.50, .64] [.65, .75] [.71, .80]
##
## 5
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## .04
## [-.06, .14]
##
##
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations
## that could have caused the sample correlation (Cumming, 2014).
## * indicates p < .05. ** indicates p < .01.
##
model_1<- lm(mh ~ jailage + gender, data = health_tbl)
par(mfrow = c(2,2))
plot(model_1)
par(mfrow=c(1,1))
tidy(summary(model_1))
model1_augment<-augment(model_1)
ggplot(model1_augment,aes(x=jailage, y=mh,color=gender)) +
geom_line(aes(y=.fitted))
model2<- lm(mh ~ jailage + gender + jailage:gender, data=health_tbl)
par(mfrow=c(2,2))
plot(model2)
par(mfrow=c(1,1))
tidy(summary(model2))
ggplot(health_tbl, aes(x=jailage, y=mh, color=gender, group=gender)) +
geom_smooth(method= "lm",se=F)
anova(model_1,model2)
summary(model2)$r.squared-summary(model_1)$r.squared
## [1] 0.002434573
I’m going to do the same thing that I did above except this will be an ANOVA.Yes its a tour of the GLM.
health_tbl<- read_tsv("./data/w4inhome_dvn.tab") %>%
transmute(admin_month = IMONTH4,
gender = factor(BIO_SEX4,levels = c(1,2),labels= c("male","female")),
living_mother = factor(H4WP1,levels= c(0,1,8),labels=c("No","Yes","Don't Know")),
fiw = replace(H4LM29, H4LM29 >=6,NA))
ggpairs(health_tbl)
options(contrasts = c("contr.sum", "contr.poly"))
linear_model<- lm(fiw ~ admin_month + living_mother + gender, data = health_tbl)
anova_model<- Anova(linear_model,type = 3)
anova_model
kable(anova_model,digits = 3, out="./output/anova_model.htm")
Sum Sq | Df | F value | Pr(>F) | |
---|---|---|---|---|
(Intercept) | 3235.235 | 1 | 4298.538 | 0.000 |
admin_month | 3.408 | 1 | 4.529 | 0.033 |
living_mother | 2.441 | 2 | 1.622 | 0.198 |
gender | 41.747 | 1 | 55.468 | 0.000 |
Residuals | 3778.233 | 5020 | NA | NA |
plot(linear_model)
pirateplot(formula = "fiw ~ living_mother + gender",data = health_tbl)
mm_df<- tidy(lsmeans(linear_model,"gender",by="living_mother"))
ggplot(mm_df,
aes(x=living_mother,
y=estimate,
color=gender,
group=gender)) +
geom_line()
health_tbl %<>% mutate(cond=interaction(gender,living_mother,sep=" x "))
posthoc_model<- lm(fiw~admin_month + cond, data = health_tbl)
posthocs <- glht(posthoc_model, linfct=mcp(cond="Tukey"))
summary(posthocs)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: lm(formula = fiw ~ admin_month + cond, data = health_tbl)
##
## Linear Hypotheses:
## Estimate Std. Error t value
## female x No - male x No == 0 -0.32298 0.10918 -2.958
## male x Yes - male x No == 0 0.02272 0.08075 0.281
## female x Yes - male x No == 0 -0.15289 0.08052 -1.899
## male x Don't Know - male x No == 0 0.09060 0.20497 0.442
## female x Don't Know - male x No == 0 -0.14135 0.19725 -0.717
## male x Yes - female x No == 0 0.34571 0.07805 4.429
## female x Yes - female x No == 0 0.17009 0.07775 2.188
## male x Don't Know - female x No == 0 0.41359 0.20393 2.028
## female x Don't Know - female x No == 0 0.18163 0.19615 0.926
## female x Yes - male x Yes == 0 -0.17562 0.02541 -6.911
## male x Don't Know - male x Yes == 0 0.06788 0.19023 0.357
## female x Don't Know - male x Yes == 0 -0.16407 0.18186 -0.902
## male x Don't Know - female x Yes == 0 0.24349 0.19011 1.281
## female x Don't Know - female x Yes == 0 0.01154 0.18171 0.064
## female x Don't Know - male x Don't Know == 0 -0.23195 0.26186 -0.886
## Pr(>|t|)
## female x No - male x No == 0 0.0269 *
## male x Yes - male x No == 0 0.9997
## female x Yes - male x No == 0 0.3414
## male x Don't Know - male x No == 0 0.9970
## female x Don't Know - male x No == 0 0.9730
## male x Yes - female x No == 0 <0.001 ***
## female x Yes - female x No == 0 0.1963
## male x Don't Know - female x No == 0 0.2703
## female x Don't Know - female x No == 0 0.9216
## female x Yes - male x Yes == 0 <0.001 ***
## male x Don't Know - male x Yes == 0 0.9989
## female x Don't Know - male x Yes == 0 0.9293
## male x Don't Know - female x Yes == 0 0.7494
## female x Don't Know - female x Yes == 0 1.0000
## female x Don't Know - male x Don't Know == 0 0.9343
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
etaSquared(linear_model, type=3,anova=T)
## eta.sq eta.sq.part SS df MS F
## admin_month 0.0008912691 0.0009012830 3.408329 1 3.4083293 4.528522
## living_mother 0.0006383947 0.0006457325 2.441305 2 1.2206523 1.621836
## gender 0.0109167729 0.0109286326 41.747165 1 41.7471646 55.467924
## Residuals 0.9879980344 NA 3778.233464 5020 0.7526361 NA
## p
## admin_month 3.338298e-02
## living_mother 1.976392e-01
## gender 1.112443e-13
## Residuals NA
linear_model_lsm<- lsmeans(linear_model,"living_mother")
contrast(linear_model_lsm, list(knowledge=c(-.5,-.5,1)))
## contrast estimate SE df t.ratio p.value
## knowledge 0.0879 0.134 5020 0.657 0.5111
##
## Results are averaged over the levels of: gender